{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tabular Datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this guide we will explore how to work with tabular data in HoloViews. Tabular data has a fixed list of column headings, with values stored in an arbitrarily long list of rows.  Spreadsheets, relational databases, CSV files, and many other typical data sources fit naturally into this format.  HoloViews defines an extensible system of interfaces to load, manipulate, and visualize this kind of data, as well as allowing conversion of any of the non-tabular data types into tabular data for analysis or data interchange.\n",
    "\n",
    "By default HoloViews will use one of these data storage formats for tabular data:\n",
    "\n",
    "* A pure Python dictionary containing 1D NumPy-arrays for each column.\n",
    "* A purely NumPy array format for numeric data.\n",
    "* Pandas DataFrames\n",
    "* Dask DataFrames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import holoviews as hv\n",
    "hv.extension('bokeh', 'matplotlib')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Simple Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Usually when working with data we have one or more independent variables, taking the form of categories, labels, discrete sample coordinates, or bins.  We refer to these independent variables as key dimensions (or ``kdims`` for short) in HoloViews. The observer or dependent variables, on the other hand, are referred to as value dimensions (``vdims``), and are ordinarily measured or calculated given the independent variables. The simplest useful form of a ``Dataset`` object is therefore a column 'x' and a column 'y' corresponding to the key dimensions and value dimensions respectively. An obvious visual representation of this data is a ``Table``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs = range(10)\n",
    "ys = np.exp(xs)\n",
    "\n",
    "table = hv.Table((xs, ys), kdims=['x'], vdims=['y'])\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, this data has many more meaningful visual representations, and therefore the first important concept is that ``Dataset`` objects can be converted to other objects as long as their dimensionality allows it, meaning that you can easily create the different objects from the same data (and cast between the objects once created):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hv.Scatter(table) + hv.Curve(table) + hv.Bars(table)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each of these three plots uses the same data, but represents a different assumption about the semantic meaning of that data -- the ``Scatter`` plot is appropriate if that data consists of independent samples, the ``Curve`` plot is appropriate for samples chosen from an underlying smooth function, and the ``Bars`` plot is appropriate for independent categories of data.  Since all these plots have the same dimensionality, they can easily be converted to each other, but there is normally only one of these representations that is semantically appropriate for the underlying data.  For this particular data, the semantically appropriate choice is ``Curve``, since the *y* values are samples from the continuous function ``exp``.\n",
    "\n",
    "As a guide to which Elements can be converted to each other, those of the same dimensionality here should be interchangeable, because of the underlying similarity of their columnar representation:\n",
    "\n",
    "* 0D: BoxWhisker, Spikes, Distribution*, \n",
    "* 1D: Scatter, Curve, ErrorBars, Spread, Bars, BoxWhisker, Regression*\n",
    "* 2D: Points, HeatMap, Bars, BoxWhisker, Bivariate*\n",
    "* 3D: Scatter3D, Trisurface, VectorField, BoxWhisker, Bars\n",
    "\n",
    "\\* - requires Seaborn\n",
    "\n",
    "This categorization is based only on the ``kdims``, which define the space in which the data has been sampled or defined. An Element can also have any number of value dimensions (``vdims``), which may be mapped onto various attributes of a plot such as the color, size, and orientation of the plotted items.  For a reference of how to use these various Element types, see the [Elements Reference](http://holoviews.org/reference/index.html#elements)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data types and Constructors\n",
    "\n",
    "As discussed above, ``Dataset`` provides an extensible interface to store and operate on data in different formats. All interfaces support a number of standard constructors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Storage formats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dataset types can be constructed using one of three supported formats, (a) a dictionary of columns, (b) an NxD array with N rows and D columns, or (c) pandas dataframes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(hv.Scatter({'x': xs, 'y': ys}) +\n",
    "      hv.Scatter(np.column_stack([xs, ys])) +\n",
    "      hv.Scatter(pd.DataFrame({'x': xs, 'y': ys})))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Literals\n",
    "\n",
    "In addition to the main storage formats, Dataset Elements support construction from three Python literal formats: (a) An iterator of y-values, (b) a tuple of columns, and (c) an iterator of row tuples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(hv.Scatter(ys) + hv.Scatter((xs, ys)) + hv.Scatter(zip(xs, ys)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For these inputs, the data will need to be copied to a new data structure, having one of the three storage formats above.  By default Dataset will try to construct a simple array, falling back to either pandas dataframes (if available) or the dictionary-based format if the data is not purely numeric. Additionally, the interfaces will try to maintain the provided data's type, so numpy arrays and pandas DataFrames will always be parsed first by their respective array and dataframe interfaces."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({'x': xs, 'y': ys, 'z': ys*2})\n",
    "print(type(hv.Scatter(df).data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dataset will attempt to parse the supplied data, falling back to each consecutive interface if the previous could not interpret the data. The default list of fallbacks and simultaneously the list of allowed datatypes is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hv.Dataset.datatype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note these include grid based datatypes, which are covered in [Gridded Datasets](http://holoviews.org/user_guide/Gridded_Datasets.html). To select a particular storage format explicitly, supply one or more allowed datatypes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(type(hv.Scatter((xs, ys), datatype=['array']).data))\n",
    "print(type(hv.Scatter((xs, ys), datatype=['dictionary']).data))\n",
    "print(type(hv.Scatter((xs, ys), datatype=['dataframe']).data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Sharing Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the formats with labelled columns do not require any specific order, each Element can effectively become a view into a single set of data. By specifying different key and value dimensions, many Elements can show different values, while sharing the same underlying data source."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "overlay = hv.Scatter(df, kdims='x', vdims='y') * hv.Scatter(df, kdims='x', vdims='z')\n",
    "overlay"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can quickly confirm that the data is actually shared:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "overlay.Scatter.I.data is overlay.Scatter.II.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For columnar data, this approach is much more efficient than creating copies of the data for each Element, and allows for some advanced features like linked brushing in the [Bokeh backend](./Plotting_with_Bokeh.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Converting to raw data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Column types make it easy to export the data to the three basic formats: arrays, dataframes, and a dictionary of columns.\n",
    "\n",
    "###### Array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table.array()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Pandas DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table.dframe().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Dataset dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table.columns()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creating tabular data from Elements using the .table and .dframe methods\n",
    "\n",
    "If you have data in some other HoloViews element and would like to use the columnar data features, you can easily tabularize any of the core Element types into a ``Table`` Element, using the ``.table()`` method.  Similarly, the ``.dframe()`` method will convert an Element into a pandas DataFrame. These methods are very useful if you want to then transform the data into a different Element type, or to perform different types of analysis.\n",
    "\n",
    "## Tabularizing simple Elements\n",
    "\n",
    "For a simple example, we can create a ``Curve`` of an exponential function and convert it to a ``Table`` with the ``.table`` method, with the same result as creating the Table directly from the data as done earlier in this user guide:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs = np.arange(10)\n",
    "curve = hv.Curve(zip(xs, np.exp(xs)))\n",
    "curve * hv.Scatter(zip(xs, curve)) + curve.table()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, we can get a pandas dataframe of the Curve using ``curve.dframe()``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "curve.dframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although 2D image-like objects are *not* inherently well suited to a flat columnar representation, serializing them by converting to tabular data is a good way to reveal the differences between Image and Raster elements.  Rasters are a very simple type of element, using array-like integer indexing of rows and columns from their top-left corner as in computer graphics applications.  Conversely, Image elements are a higher-level abstraction that provides a general-purpose continuous Cartesian coordinate system, with x and y increasing to the right and upwards as in mathematical applications, and each point interpreted as a sample representing the pixel in which it is located (and thus centered within that pixel).  Given the same data, the ``.table()`` representation will show how the data is being interpreted (and accessed) differently in the two cases (as explained in detail in the [Continuous Coordinates](Continuous_Coordinates.ipynb) user guide):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%opts Points (s=200) [size_index=None]\n",
    "extents = (-1.6,-2.7,2.0,3)\n",
    "np.random.seed(42)\n",
    "mat = np.random.rand(3, 3)\n",
    "\n",
    "img = hv.Image(mat, bounds=extents)\n",
    "raster = hv.Raster(mat)\n",
    "\n",
    "img    * hv.Points(img)    + img.table() + \\\n",
    "raster * hv.Points(raster) + raster.table()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tabularizing space containers\n",
    "\n",
    "Even deeply nested objects can be deconstructed in this way, serializing them to make it easier to get your raw data out of a collection of specialized ``Element`` types. Let's say we want to make multiple observations of a noisy signal. We can collect the data into a ``HoloMap`` to visualize it and then call ``.table()`` to get a columnar object to which we can apply operations or transformations to other ``Element`` types. Deconstructing nested data in this way only works if the data is homogeneous. In practical terms this requires that your data structure contains Elements (of any type) held in these Container types: ``NdLayout``, ``GridSpace``, ``HoloMap``, and ``NdOverlay``, with all dimensions consistent throughout (so that they can all fit into the same set of columns). To read more about these containers see the [Dimensioned Containers](./Dimensioned_Containers.ipynb) guide.\n",
    "\n",
    "Let's now go back to the ``Image`` example. We will collect a number of observations of some noisy data into a ``HoloMap`` and display it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_hmap = hv.HoloMap({i: hv.Image(np.random.randn(10, 10), bounds=(0,0,3,3))\n",
    "                   for i in range(3)}, key_dimensions=['Observation'])\n",
    "obs_hmap"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can serialize this data just as before, where this time we get a four-column (4D) table. The key dimensions of both the HoloMap and the Images, as well as the z-values of each ``Image``, are all merged into a single table. We can visualize the samples we have collected by converting it to a ``Scatter3D`` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%output backend='matplotlib'\n",
    "%%opts Layout [fig_size=150] Scatter3D [color_index=3 size_index=None] (cmap='hot' edgecolor='k' s=50)\n",
    "obs_hmap.table().to.scatter3d() + obs_hmap.table()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the `z` dimension is shown by color, as in the original images, and the other three dimensions determine where the datapoint is shown in 3D. This way of deconstructing objects will work for any data structure that satisfies the conditions described above, no matter how nested. If we vary the amount of noise while continuing to performing multiple observations, we can create an ``NdLayout`` of HoloMaps, one for each noise level, and animated by the observation number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%opts Image [width=225 height=225]\n",
    "from itertools import product\n",
    "extents = (0,0,3,3)\n",
    "error_hmap = hv.HoloMap({(i, j): hv.Image(j*np.random.randn(3, 3), bounds=extents)\n",
    "                         for i, j in product(range(3), np.linspace(0, 1, 3))},\n",
    "                        key_dimensions=['Observation', 'noise'])\n",
    "noise_layout = error_hmap.layout('noise')\n",
    "noise_layout"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And again, we can easily convert the object to a ``Table``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "noise_layout.table()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Applying operations to the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Sorting by columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once data is in columnar form, it is simple to apply a variety of operations.  For instance, Dataset can be sorted by their dimensions using the ``.sort()`` method.  By default, this method will sort by the key dimensions in an ascending order, but any other dimension(s) can be sorted by providing them as an argument list to the sort method. The ``reverse`` argument also allows sorting in descending order:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bars = hv.Bars((['C', 'A', 'B', 'D'], [2, 7, 3, 4]))\n",
    "(bars +\n",
    " bars.sort().relabel('sorted') +\n",
    " bars.sort(['y']).relabel('y-sorted') +\n",
    " bars.sort(reverse=True).relabel('reverse sorted')).cols(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Working with categorical or grouped data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data is often grouped in various ways, and the Dataset interface provides various means to easily compare between groups and apply statistical aggregates. We'll start by generating some synthetic data with two groups along the x axis and 4 groups along the y axis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = np.arange(1000)\n",
    "xs = np.repeat(range(2), 500)\n",
    "ys = n%4\n",
    "zs = np.random.randn(1000)\n",
    "table = hv.Table((xs, ys, zs), kdims=['x', 'y'], vdims=['z'])\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since there are repeat observations of the same x- and y-values, we may want to reduce the data before we display it or else use a datatype that supports plotting distributions in this way. The ``BoxWhisker`` type allows doing exactly that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hv.BoxWhisker(table)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Aggregating/Reducing dimensions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most types require the data to be non-duplicated before being displayed.  For this purpose, HoloViews makes it easy to ``aggregate`` and ``reduce`` the data. These two operations are simple complements of each other--aggregate computes a statistic for each group in the supplied dimensions, while reduce combines all the groups except the supplied dimensions. Supplying only a function and no dimensions will simply aggregate or reduce all available key dimensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%opts Bars {+axiswise}\n",
    "hv.Bars(table).aggregate('x', function=np.mean) + hv.Bars(table).reduce(x=np.mean)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(**A**) aggregates over both the x and y dimension, computing the mean for each x/y group, while (**B**) reduces the x dimension leaving just the mean for each group along y."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Collapsing multiple Dataset Elements"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When multiple observations are broken out into a ``HoloMap`` they can easily be combined using the ``collapse`` method. Here we create a number of Curves with increasingly larger y-values. By collapsing them with a ``function`` and a ``spreadfn`` we can compute the mean curve with a confidence interval. We then simply cast the collapsed ``Curve`` to a ``Spread`` and ``Curve`` Element to visualize them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hmap = hv.HoloMap({i: hv.Curve(np.arange(10)*i) for i in range(10)})\n",
    "collapsed = hmap.collapse(function=np.mean, spreadfn=np.std)\n",
    "hv.Spread(collapsed) * hv.Curve(collapsed) + collapsed.table()"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
